AITopics

Country: North America > United States > Texas (0.28)

Genre: Research Report > New Finding (0.46)

Industry:

Transportation (0.48)
Government (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Data Science > Data Mining (0.97)
(3 more...)

Neural Information Processing SystemsFeb-7-2026, 08:44:59 GMT

0768281a05da9f27df178b5c39a51263-Paper.pdf

neural network, neurwin, whittle index, (14 more...)

Country:

North America > United States > Texas > Brazos County > College Station (0.14)
Asia > Taiwan (0.04)
Oceania > New Zealand (0.04)
(3 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Transportation (0.48)
Government (0.46)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Data Science > Data Mining > Big Data (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Neural Information Processing SystemsDec-24-2025, 10:53:33 GMT

Learning Infinite-Horizon Average-Reward Restless Multi-Action Bandits via Index Awareness

We consider the online restless bandits with average-reward and multiple actions, where the state of each arm evolves according to a Markov decision process (MDP), and the reward of pulling an arm depends on both the current state of the corresponding MDP and the action taken. Since finding the optimal control is typically intractable for restless bandits, existing learning algorithms are often computationally expensive or with a regret bound that is exponential in the number of arms and states. In this paper, we advocate \textit{index-aware reinforcement learning} (RL) solutions to design RL algorithms operating on a much smaller dimensional subspace by exploiting the inherent structure in restless bandits. Specifically, we first propose novel index policies to address dimensionality concerns, which are provably optimal. We then leverage the indices to develop two low-complexity index-aware RL algorithms, namely, (i) GM-R2MAB, which has access to a generative model; and (ii) UC-R2MAB, which learns the model using an upper confidence style online exploitation method. We prove that both algorithms achieve a sub-linear regret that is only polynomial in the number of arms and states. A key differentiator between our algorithms and existing ones stems from the fact that our RL algorithms contain a novel exploitation that leverages our proposed provably optimal index policies for decision-makings.

algorithm, infinite-horizon average-reward restless multi-action bandit, name change, (7 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Christopher R. Dance, Tomi Silander

When are Kalman-Filter Restless Bandits Indexable?

Neural Information Processing SystemsOct-2-2025, 08:37:10 GMT

We study the restless bandit associated with an extremely simple scalar Kalman filter model in discrete time. Under certain assumptions, we prove that the problem is indexable in the sense that the Whittle index is a non-decreasing function of the relevant belief state. In spite of the long history of this problem, this appears to be the first such proof. We use results about Schur-convexity and mechanical words, which are particular binary strings intimately related to palindromes.

mechanical word, palindrome, restless bandit, (14 more...)

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Spain > Galicia > Madrid (0.04)
Europe > France (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.94)
Information Technology > Data Science > Data Mining (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.66)

Neural Information Processing SystemsAug-15-2025, 20:02:07 GMT

71f003060ce1e8b6b4856023b67cda5d-Paper-Conference.pdf

data mining, machine learning, reinforcement learning, (18 more...)

Country:

Asia > India (0.04)
North America > United States > New York > Broome County > Binghamton (0.04)
South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
(4 more...)

Genre: Research Report (0.68)

Industry:

Health & Medicine (1.00)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.49)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Chen, Shuze, Peng, Tianyi

Multi-agent Markov Entanglement

arXiv.org Machine LearningJun-30-2025

Value decomposition has long been a fundamental technique in multi-agent dynamic programming and reinforcement learning (RL). Specifically, the value function of a global state $(s_1,s_2,\ldots,s_N)$ is often approximated as the sum of local functions: $V(s_1,s_2,\ldots,s_N)\approx\sum_{i=1}^N V_i(s_i)$. This approach traces back to the index policy in restless multi-armed bandit problems and has found various applications in modern RL systems. However, the theoretical justification for why this decomposition works so effectively remains underexplored. In this paper, we uncover the underlying mathematical structure that enables value decomposition. We demonstrate that a multi-agent Markov decision process (MDP) permits value decomposition if and only if its transition matrix is not "entangled" -- a concept analogous to quantum entanglement in quantum physics. Drawing inspiration from how physicists measure quantum entanglement, we introduce how to measure the "Markov entanglement" for multi-agent MDPs and show that this measure can be used to bound the decomposition error in general multi-agent MDPs. Using the concept of Markov entanglement, we proved that a widely-used class of index policies is weakly entangled and enjoys a sublinear $\mathcal O(\sqrt{N})$ scale of decomposition error for $N$-agent systems. Finally, we show how Markov entanglement can be efficiently estimated in practice, providing practitioners with an empirical proxy for the quality of value decomposition.

artificial intelligence, data mining, machine learning, (13 more...)

arXiv.org Machine Learning

2506.02385

Country:

North America > United States > New York > New York County > New York City (0.04)
Oceania > New Zealand (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Africa > Kenya (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

arXiv.org Artificial IntelligenceFeb-7-2025

From Restless to Contextual: A Thresholding Bandit Approach to Improve Finite-horizon Performance

Xu, Jiamin, Nazarov, Ivan, Rastogi, Aditya, Periáñez, África, Gan, Kyra

Online restless bandits extend classic contextual bandits by incorporating state transitions and budget constraints, representing each agent as a Markov Decision Process (MDP). This framework is crucial for finite-horizon strategic resource allocation, optimizing limited costly interventions for long-term benefits. However, learning the underlying MDP for each agent poses a major challenge in finite-horizon settings. To facilitate learning, we reformulate the problem as a scalable budgeted thresholding contextual bandit problem, carefully integrating the state transitions into the reward design and focusing on identifying agents with action benefits exceeding a threshold. We establish the optimality of an oracle greedy solution in a simple two-state setting, and propose an algorithm that achieves minimax optimal constant regret in the online multi-state setting with heterogeneous agents and knowledge of outcomes under no intervention. We numerically show that our algorithm outperforms existing online restless bandit methods, offering significant improvements in finite-horizon performance.

artificial intelligence, data mining, machine learning, (15 more...)

2502.05145

Country:

Africa (0.04)
North America > United States > New York > Tompkins County > Ithaca (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine > Health Care Providers & Services (0.45)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

arXiv.org Artificial IntelligenceJan-10-2025

Finite-Horizon Single-Pull Restless Bandits: An Efficient Index Policy For Scarce Resource Allocation

Xiong, Guojun, Wang, Haichuan, Pan, Yuqi, Mandal, Saptarshi, Shah, Sanket, Boehmer, Niclas, Tambe, Milind

Restless multi-armed bandits (RMABs) have been highly successful in optimizing sequential resource allocation across many domains. However, in many practical settings with highly scarce resources, where each agent can only receive at most one resource, such as healthcare intervention programs, the standard RMAB framework falls short. To tackle such scenarios, we introduce Finite-Horizon Single-Pull RMABs (SPRMABs), a novel variant in which each arm can only be pulled once. This single-pull constraint introduces additional complexity, rendering many existing RMAB solutions suboptimal or ineffective. %To address this, we propose using dummy states to duplicate the system, ensuring that once an arm is activated, it transitions exclusively within the dummy states. To address this shortcoming, we propose using \textit{dummy states} that expand the system and enforce the one-pull constraint. We then design a lightweight index policy for this expanded system. For the first time, we demonstrate that our index policy achieves a sub-linearly decaying average optimality gap of $\tilde{\mathcal{O}}\left(\frac{1}{\rho^{1/2}}\right)$ for a finite number of arms, where $\rho$ is the scaling factor for each arm cluster. Extensive simulations validate the proposed method, showing robust performance across various domains compared to existing benchmarks.

constraint, data mining, machine learning, (18 more...)

2501.06103

Country:

North America > United States > Michigan > Wayne County > Detroit (0.04)
Asia > India (0.04)
Oceania > New Zealand (0.04)
(4 more...)

Genre: Research Report > Experimental Study (0.46)

Industry: Health & Medicine > Therapeutic Area > Immunology (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.88)
(2 more...)

Avrachenkov, Konstantin, Borkar, Vivek S., Shah, Pratik

Lagrangian Index Policy for Restless Bandits with Average Reward

arXiv.org Artificial IntelligenceDec-17-2024

We study the Lagrangian Index Policy (LIP) for restless multi-armed bandits with long-run average reward. In particular, we compare the performance of LIP with the performance of the Whittle Index Policy (WIP), both heuristic policies known to be asymptotically optimal under certain natural conditions. Even though in most cases their performances are very similar, in the cases when WIP shows bad performance, LIP continues to perform very well. We then propose reinforcement learning algorithms, both tabular and NN-based, to obtain online learning schemes for LIP in the model-free setting. The proposed reinforcement learning schemes for LIP requires significantly less memory than the analogous scheme for WIP. We calculate analytically the Lagrangian index for the restart model, which describes the optimal web crawling and the minimization of the weighted age of information. We also give a new proof of asymptotic optimality in case of homogeneous bandits as the number of arms goes to infinity, based on exchangeability and de Finetti's theorem.

data mining, machine learning, reinforcement learning, (17 more...)

2412.12641

Country:

Asia > India > Maharashtra > Mumbai (0.04)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
Europe > France (0.04)

Genre: Research Report (1.00)

Industry:

Health & Medicine (0.68)
Education > Educational Setting (0.34)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Fu, Jing, Moran, Bill, Niño-Mora, José

Multi-Action Restless Bandits with Weakly Coupled Constraints: Simultaneous Learning and Control

arXiv.org Artificial IntelligenceDec-4-2024

We study a system with finitely many groups of multi-action bandit processes, each of which is a Markov decision process (MDP) with finite state and action spaces and potentially different transition matrices when taking different actions. The bandit processes of the same group share the same state and action spaces and, given the same action that is taken, the same transition matrix. All the bandit processes across various groups are subject to multiple weakly coupled constraints over their state and action variables. Unlike the past studies that focused on the offline case, we consider the online case without assuming full knowledge of transition matrices and reward functions a priori and propose an effective scheme that enables simultaneous learning and control. We prove the convergence of the relevant processes in both the timeline and the number of the bandit processes, referred to as the convergence in the time and the magnitude dimensions. Moreover, we prove that the relevant processes converge exponentially fast in the magnitude dimension, leading to exponentially diminishing performance deviation between the proposed online algorithms and offline optimality. Jing Fu is with Department of Electrical and Electronic Engineering, School of Engineering, STEM College, RMIT University, Australia (e-mail: jing.fu@rmit.edu.au). Bill Moran is with Department of Electrical and Electronic Engineering, the University of Melbourne, VIC 3010, Australia (e-mail:wmoran@unimelb.edu.au).

bandit process, convergence, index policy, (14 more...)

2412.03326

Country:

Oceania > Australia > Victoria > Melbourne (0.24)
Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
North America > United States > New York (0.04)
(5 more...)

Genre: Research Report (0.49)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)